knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(rstan)
## Loading required package: StanHeaders
## rstan (Version 2.19.3, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## 
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
## 
##     extract
library(brms)
## Loading required package: Rcpp
## Loading 'brms' package (version 2.12.0). Useful instructions
## can be found by typing help('brms'). A more detailed introduction
## to the package is available through vignette('brms_overview').
## 
## Attaching package: 'brms'
## The following object is masked from 'package:rstan':
## 
##     loo
## The following object is masked from 'package:stats':
## 
##     ar
library(furrr)
## Loading required package: future
library(modelr)
library(tidybayes)
## 
## Attaching package: 'tidybayes'
## The following objects are masked from 'package:brms':
## 
##     dstudent_t, pstudent_t, qstudent_t, rstudent_t
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

# theme_set(theme_minimal()) +
#   theme(
#     axis.text = element_text(size = 12),
#     axis.title = element_text(size = 14),
#     axis.text.y = element_blank(),
#     axis.title.y = element_blank(),
#     strip.text = element_text(size = 14),
#     panel.spacing = unit(4, "lines")#
#   )

Exploratory analysis

Number of false positives in each between and within subjects condition

Tally the false positives by each condition. The graphs below show the distribution of the number of false positives in a single trial. We can see that most people are making 2 or fewer false positives in each trial, however we do not see much differences based on the number of regions shown to participants.

Correctness of responses in each between and within subject condition

## # A tibble: 12 x 7
##    condition      nregions    tp    tn     fp    fn   fdr
##    <fct>             <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
##  1 ci                    8 0.273 0.417 0.0795 0.230 0.225
##  2 ci                   12 0.267 0.432 0.0588 0.243 0.181
##  3 dotplot               8 0.325 0.406 0.0899 0.179 0.217
##  4 dotplot              12 0.311 0.427 0.0635 0.198 0.169
##  5 halfeye               8 0.295 0.423 0.0736 0.209 0.200
##  6 halfeye              12 0.279 0.442 0.0481 0.230 0.147
##  7 hops_bootstrap        8 0.184 0.439 0.0580 0.319 0.240
##  8 hops_bootstrap       12 0.177 0.445 0.0457 0.332 0.205
##  9 hops_mean             8 0.277 0.438 0.0579 0.227 0.173
## 10 hops_mean            12 0.275 0.449 0.0414 0.235 0.131
## 11 raw_data              8 0.242 0.430 0.0662 0.262 0.215
## 12 raw_data             12 0.238 0.437 0.0536 0.272 0.184

Visualising the average number of regions selected by participants

We explore if there are differences in the number of regions that are selected by the participants when the possible number of testable hypotheses changes (i.e. 8 or 12). We want to make sure that average participants do not resort to a strategy where they select a fixed number of regions regardless of the potential number of testable hypotheses.

Probability of TP/TN/FP/FN in each condition

In the following graph we show the mean values of TP, TN, FP and FN in each uncertainty visualization condition, and separated by the number of graphs that we show each participant.

## # A tibble: 20 x 4
##    ntrials method      .category  .value
##      <int> <fct>       <chr>       <dbl>
##  1       8 bh          TP        0.204  
##  2       8 bh          TN        0.479  
##  3       8 bh          FP        0.0179 
##  4       8 bh          FN        0.3    
##  5       8 bh          FDR       0.0353 
##  6       8 uncorrected TP        0.336  
##  7       8 uncorrected TN        0.446  
##  8       8 uncorrected FP        0.05   
##  9       8 uncorrected FN        0.168  
## 10       8 uncorrected FDR       0.0934 
## 11      12 bh          TP        0.219  
## 12      12 bh          TN        0.483  
## 13      12 bh          FP        0.00714
## 14      12 bh          FN        0.290  
## 15      12 bh          FDR       0.0164 
## 16      12 uncorrected TP        0.329  
## 17      12 uncorrected TN        0.469  
## 18      12 uncorrected FP        0.0214 
## 19      12 uncorrected FN        0.181  
## 20      12 uncorrected FDR       0.0321
## [1] "check's out"

Modeling

The research questions in our study are:

  • RQ1: do users implicitly perform some form of multiple comparisons correction?
  • RQ2: do different types of uncertainty representations help users perform multiple comparisons correction by reducing the false discovery rate (FP / (FP + TP)).

The goal of our modeling is to estimate the probability of a TP / TN / FP / FN for a given (or average trial) along with some uncertainty. Based on the results from our model, we attempt to answer our research questions.

Multiple comparisons correction

We define the model and create the appropriate column in the data structure (y) for predicting multinomial outcome variables (brms requires the outcome variable to be a n \(\times\) k matrix where k is the number of categories, and n is the number of responses; here # of trials \(\times\) # of participants).

Model fit

##  Family: multinomial 
##   Links: mutn = logit; mufn = logit; mufp = logit 
## Formula: y | trials(ntrials) ~ condition * adj_trial_id * nregions + (adj_trial_id * nregions | prolific_pid) 
##    Data: df (Number of observations: 24920) 
## Samples: 4 chains, each with iter = 1250; warmup = 0; thin = 1;
##          total post-warmup samples = 5000
## 
## Group-Level Effects: 
## ~prolific_pid (Number of levels: 356) 
##                                                     Estimate Est.Error l-95% CI
## sd(mutn_Intercept)                                      0.56      0.02     0.51
## sd(mutn_adj_trial_id)                                   0.21      0.02     0.17
## sd(mutn_nregions12)                                     0.21      0.02     0.17
## sd(mutn_adj_trial_id:nregions12)                        0.18      0.04     0.09
## sd(mufn_Intercept)                                      0.96      0.04     0.89
## sd(mufn_adj_trial_id)                                   0.36      0.04     0.29
## sd(mufn_nregions12)                                     0.44      0.03     0.39
## sd(mufn_adj_trial_id:nregions12)                        0.53      0.05     0.43
## sd(mufp_Intercept)                                      0.83      0.04     0.75
## sd(mufp_adj_trial_id)                                   0.50      0.06     0.39
## sd(mufp_nregions12)                                     0.47      0.04     0.40
## sd(mufp_adj_trial_id:nregions12)                        1.07      0.09     0.90
## cor(mutn_Intercept,mutn_adj_trial_id)                   0.22      0.11     0.00
## cor(mutn_Intercept,mutn_nregions12)                    -0.08      0.09    -0.25
## cor(mutn_adj_trial_id,mutn_nregions12)                 -0.04      0.14    -0.31
## cor(mutn_Intercept,mutn_adj_trial_id:nregions12)        0.19      0.17    -0.15
## cor(mutn_adj_trial_id,mutn_adj_trial_id:nregions12)     0.20      0.20    -0.17
## cor(mutn_nregions12,mutn_adj_trial_id:nregions12)       0.65      0.15     0.31
## cor(mufn_Intercept,mufn_adj_trial_id)                   0.11      0.10    -0.09
## cor(mufn_Intercept,mufn_nregions12)                    -0.05      0.07    -0.19
## cor(mufn_adj_trial_id,mufn_nregions12)                 -0.18      0.10    -0.38
## cor(mufn_Intercept,mufn_adj_trial_id:nregions12)        0.06      0.10    -0.13
## cor(mufn_adj_trial_id,mufn_adj_trial_id:nregions12)    -0.07      0.14    -0.32
## cor(mufn_nregions12,mufn_adj_trial_id:nregions12)       0.67      0.08     0.49
## cor(mufp_Intercept,mufp_adj_trial_id)                   0.38      0.10     0.17
## cor(mufp_Intercept,mufp_nregions12)                    -0.02      0.10    -0.21
## cor(mufp_adj_trial_id,mufp_nregions12)                 -0.19      0.12    -0.42
## cor(mufp_Intercept,mufp_adj_trial_id:nregions12)       -0.15      0.10    -0.34
## cor(mufp_adj_trial_id,mufp_adj_trial_id:nregions12)    -0.63      0.08    -0.77
## cor(mufp_nregions12,mufp_adj_trial_id:nregions12)       0.50      0.10     0.29
##                                                     u-95% CI Rhat Bulk_ESS
## sd(mutn_Intercept)                                      0.60 1.00     3328
## sd(mutn_adj_trial_id)                                   0.26 1.00     3253
## sd(mutn_nregions12)                                     0.26 1.00     2473
## sd(mutn_adj_trial_id:nregions12)                        0.26 1.00     1483
## sd(mufn_Intercept)                                      1.04 1.00     3140
## sd(mufn_adj_trial_id)                                   0.44 1.00     3478
## sd(mufn_nregions12)                                     0.50 1.00     2692
## sd(mufn_adj_trial_id:nregions12)                        0.64 1.00     2175
## sd(mufp_Intercept)                                      0.91 1.00     4519
## sd(mufp_adj_trial_id)                                   0.61 1.00     3925
## sd(mufp_nregions12)                                     0.56 1.00     3751
## sd(mufp_adj_trial_id:nregions12)                        1.26 1.00     2758
## cor(mutn_Intercept,mutn_adj_trial_id)                   0.42 1.00     4618
## cor(mutn_Intercept,mutn_nregions12)                     0.11 1.00     4078
## cor(mutn_adj_trial_id,mutn_nregions12)                  0.24 1.00     1680
## cor(mutn_Intercept,mutn_adj_trial_id:nregions12)        0.51 1.00     4445
## cor(mutn_adj_trial_id,mutn_adj_trial_id:nregions12)     0.60 1.00     2979
## cor(mutn_nregions12,mutn_adj_trial_id:nregions12)       0.87 1.00     2731
## cor(mufn_Intercept,mufn_adj_trial_id)                   0.30 1.00     4296
## cor(mufn_Intercept,mufn_nregions12)                     0.09 1.00     4345
## cor(mufn_adj_trial_id,mufn_nregions12)                  0.02 1.00     1901
## cor(mufn_Intercept,mufn_adj_trial_id:nregions12)        0.25 1.00     4791
## cor(mufn_adj_trial_id,mufn_adj_trial_id:nregions12)     0.22 1.00     2478
## cor(mufn_nregions12,mufn_adj_trial_id:nregions12)       0.81 1.00     2536
## cor(mufp_Intercept,mufp_adj_trial_id)                   0.57 1.00     4119
## cor(mufp_Intercept,mufp_nregions12)                     0.18 1.00     4574
## cor(mufp_adj_trial_id,mufp_nregions12)                  0.05 1.00     3059
## cor(mufp_Intercept,mufp_adj_trial_id:nregions12)        0.03 1.00     4325
## cor(mufp_adj_trial_id,mufp_adj_trial_id:nregions12)    -0.46 1.00     2294
## cor(mufp_nregions12,mufp_adj_trial_id:nregions12)       0.68 1.00     2405
##                                                     Tail_ESS
## sd(mutn_Intercept)                                      4274
## sd(mutn_adj_trial_id)                                   4060
## sd(mutn_nregions12)                                     3574
## sd(mutn_adj_trial_id:nregions12)                        2233
## sd(mufn_Intercept)                                      4104
## sd(mufn_adj_trial_id)                                   4420
## sd(mufn_nregions12)                                     3662
## sd(mufn_adj_trial_id:nregions12)                        2779
## sd(mufp_Intercept)                                      4727
## sd(mufp_adj_trial_id)                                   4557
## sd(mufp_nregions12)                                     4776
## sd(mufp_adj_trial_id:nregions12)                        4271
## cor(mutn_Intercept,mutn_adj_trial_id)                   4750
## cor(mutn_Intercept,mutn_nregions12)                     4445
## cor(mutn_adj_trial_id,mutn_nregions12)                  2836
## cor(mutn_Intercept,mutn_adj_trial_id:nregions12)        4943
## cor(mutn_adj_trial_id,mutn_adj_trial_id:nregions12)     4347
## cor(mutn_nregions12,mutn_adj_trial_id:nregions12)       3174
## cor(mufn_Intercept,mufn_adj_trial_id)                   4789
## cor(mufn_Intercept,mufn_nregions12)                     4745
## cor(mufn_adj_trial_id,mufn_nregions12)                  3405
## cor(mufn_Intercept,mufn_adj_trial_id:nregions12)        4912
## cor(mufn_adj_trial_id,mufn_adj_trial_id:nregions12)     3701
## cor(mufn_nregions12,mufn_adj_trial_id:nregions12)       3705
## cor(mufp_Intercept,mufp_adj_trial_id)                   4385
## cor(mufp_Intercept,mufp_nregions12)                     4617
## cor(mufp_adj_trial_id,mufp_nregions12)                  4118
## cor(mufp_Intercept,mufp_adj_trial_id:nregions12)        4284
## cor(mufp_adj_trial_id,mufp_adj_trial_id:nregions12)     3811
## cor(mufp_nregions12,mufp_adj_trial_id:nregions12)       3871
## 
## Population-Level Effects: 
##                                                      Estimate Est.Error
## mutn_Intercept                                           0.49      0.07
## mufn_Intercept                                          -0.23      0.11
## mufp_Intercept                                          -1.78      0.11
## mutn_conditiondotplot                                   -0.23      0.10
## mutn_conditionhalfeye                                   -0.05      0.10
## mutn_conditionhops_bootstrap                             0.45      0.10
## mutn_conditionhops_mean                                  0.01      0.10
## mutn_conditionraw_data                                   0.17      0.10
## mutn_adj_trial_id                                        0.12      0.05
## mutn_nregions12                                          0.12      0.04
## mutn_conditiondotplot:adj_trial_id                      -0.05      0.08
## mutn_conditionhalfeye:adj_trial_id                      -0.05      0.08
## mutn_conditionhops_bootstrap:adj_trial_id               -0.03      0.08
## mutn_conditionhops_mean:adj_trial_id                    -0.01      0.08
## mutn_conditionraw_data:adj_trial_id                      0.10      0.08
## mutn_conditiondotplot:nregions12                        -0.03      0.06
## mutn_conditionhalfeye:nregions12                        -0.04      0.06
## mutn_conditionhops_bootstrap:nregions12                  0.03      0.06
## mutn_conditionhops_mean:nregions12                       0.01      0.06
## mutn_conditionraw_data:nregions12                       -0.06      0.06
## mutn_adj_trial_id:nregions12                             0.10      0.07
## mutn_conditiondotplot:adj_trial_id:nregions12           -0.11      0.11
## mutn_conditionhalfeye:adj_trial_id:nregions12           -0.03      0.11
## mutn_conditionhops_bootstrap:adj_trial_id:nregions12     0.21      0.11
## mutn_conditionhops_mean:adj_trial_id:nregions12          0.13      0.11
## mutn_conditionraw_data:adj_trial_id:nregions12           0.00      0.11
## mufn_conditiondotplot                                   -0.48      0.16
## mufn_conditionhalfeye                                   -0.19      0.16
## mufn_conditionhops_bootstrap                             0.77      0.16
## mufn_conditionhops_mean                                 -0.06      0.16
## mufn_conditionraw_data                                   0.30      0.16
## mufn_adj_trial_id                                        0.23      0.07
## mufn_nregions12                                          0.10      0.07
## mufn_conditiondotplot:adj_trial_id                       0.06      0.11
## mufn_conditionhalfeye:adj_trial_id                      -0.15      0.11
## mufn_conditionhops_bootstrap:adj_trial_id               -0.04      0.11
## mufn_conditionhops_mean:adj_trial_id                    -0.02      0.11
## mufn_conditionraw_data:adj_trial_id                      0.17      0.10
## mufn_conditiondotplot:nregions12                         0.01      0.10
## mufn_conditionhalfeye:nregions12                         0.03      0.10
## mufn_conditionhops_bootstrap:nregions12                  0.14      0.10
## mufn_conditionhops_mean:nregions12                       0.11      0.10
## mufn_conditionraw_data:nregions12                        0.04      0.10
## mufn_adj_trial_id:nregions12                             0.04      0.10
## mufn_conditiondotplot:adj_trial_id:nregions12           -0.13      0.16
## mufn_conditionhalfeye:adj_trial_id:nregions12            0.06      0.15
## mufn_conditionhops_bootstrap:adj_trial_id:nregions12     0.34      0.15
## mufn_conditionhops_mean:adj_trial_id:nregions12          0.28      0.15
## mufn_conditionraw_data:adj_trial_id:nregions12           0.05      0.15
## mufp_conditiondotplot                                    0.09      0.15
## mufp_conditionhalfeye                                   -0.15      0.16
## mufp_conditionhops_bootstrap                             0.14      0.15
## mufp_conditionhops_mean                                 -0.25      0.15
## mufp_conditionraw_data                                   0.04      0.15
## mufp_adj_trial_id                                       -0.28      0.10
## mufp_nregions12                                         -0.45      0.09
## mufp_conditiondotplot:adj_trial_id                       0.15      0.13
## mufp_conditionhalfeye:adj_trial_id                      -0.11      0.14
## mufp_conditionhops_bootstrap:adj_trial_id               -0.09      0.14
## mufp_conditionhops_mean:adj_trial_id                    -0.13      0.14
## mufp_conditionraw_data:adj_trial_id                     -0.17      0.14
## mufp_conditiondotplot:nregions12                         0.07      0.12
## mufp_conditionhalfeye:nregions12                        -0.01      0.12
## mufp_conditionhops_bootstrap:nregions12                  0.09      0.13
## mufp_conditionhops_mean:nregions12                      -0.02      0.13
## mufp_conditionraw_data:nregions12                        0.16      0.13
## mufp_adj_trial_id:nregions12                            -0.02      0.16
## mufp_conditiondotplot:adj_trial_id:nregions12           -0.23      0.23
## mufp_conditionhalfeye:adj_trial_id:nregions12            0.07      0.23
## mufp_conditionhops_bootstrap:adj_trial_id:nregions12     0.04      0.24
## mufp_conditionhops_mean:adj_trial_id:nregions12         -0.03      0.24
## mufp_conditionraw_data:adj_trial_id:nregions12           0.13      0.23
##                                                      l-95% CI u-95% CI Rhat
## mutn_Intercept                                           0.35     0.63 1.00
## mufn_Intercept                                          -0.45    -0.01 1.00
## mufp_Intercept                                          -1.99    -1.58 1.00
## mutn_conditiondotplot                                   -0.42    -0.03 1.00
## mutn_conditionhalfeye                                   -0.26     0.15 1.00
## mutn_conditionhops_bootstrap                             0.24     0.65 1.00
## mutn_conditionhops_mean                                 -0.20     0.21 1.00
## mutn_conditionraw_data                                  -0.03     0.36 1.00
## mutn_adj_trial_id                                        0.01     0.23 1.00
## mutn_nregions12                                          0.03     0.21 1.00
## mutn_conditiondotplot:adj_trial_id                      -0.20     0.10 1.00
## mutn_conditionhalfeye:adj_trial_id                      -0.20     0.10 1.00
## mutn_conditionhops_bootstrap:adj_trial_id               -0.19     0.12 1.00
## mutn_conditionhops_mean:adj_trial_id                    -0.16     0.14 1.00
## mutn_conditionraw_data:adj_trial_id                     -0.06     0.26 1.00
## mutn_conditiondotplot:nregions12                        -0.15     0.09 1.00
## mutn_conditionhalfeye:nregions12                        -0.16     0.08 1.00
## mutn_conditionhops_bootstrap:nregions12                 -0.10     0.15 1.00
## mutn_conditionhops_mean:nregions12                      -0.10     0.13 1.00
## mutn_conditionraw_data:nregions12                       -0.18     0.06 1.00
## mutn_adj_trial_id:nregions12                            -0.05     0.25 1.00
## mutn_conditiondotplot:adj_trial_id:nregions12           -0.32     0.09 1.00
## mutn_conditionhalfeye:adj_trial_id:nregions12           -0.24     0.18 1.00
## mutn_conditionhops_bootstrap:adj_trial_id:nregions12    -0.01     0.43 1.00
## mutn_conditionhops_mean:adj_trial_id:nregions12         -0.08     0.34 1.00
## mutn_conditionraw_data:adj_trial_id:nregions12          -0.22     0.22 1.00
## mufn_conditiondotplot                                   -0.79    -0.16 1.00
## mufn_conditionhalfeye                                   -0.50     0.14 1.00
## mufn_conditionhops_bootstrap                             0.45     1.09 1.00
## mufn_conditionhops_mean                                 -0.37     0.27 1.00
## mufn_conditionraw_data                                  -0.02     0.61 1.00
## mufn_adj_trial_id                                        0.08     0.37 1.00
## mufn_nregions12                                         -0.04     0.23 1.00
## mufn_conditiondotplot:adj_trial_id                      -0.16     0.28 1.00
## mufn_conditionhalfeye:adj_trial_id                      -0.35     0.07 1.00
## mufn_conditionhops_bootstrap:adj_trial_id               -0.24     0.17 1.00
## mufn_conditionhops_mean:adj_trial_id                    -0.22     0.18 1.00
## mufn_conditionraw_data:adj_trial_id                     -0.03     0.38 1.00
## mufn_conditiondotplot:nregions12                        -0.18     0.21 1.00
## mufn_conditionhalfeye:nregions12                        -0.17     0.22 1.00
## mufn_conditionhops_bootstrap:nregions12                 -0.05     0.34 1.00
## mufn_conditionhops_mean:nregions12                      -0.09     0.30 1.00
## mufn_conditionraw_data:nregions12                       -0.15     0.24 1.00
## mufn_adj_trial_id:nregions12                            -0.17     0.24 1.00
## mufn_conditiondotplot:adj_trial_id:nregions12           -0.44     0.18 1.00
## mufn_conditionhalfeye:adj_trial_id:nregions12           -0.24     0.37 1.00
## mufn_conditionhops_bootstrap:adj_trial_id:nregions12     0.05     0.63 1.00
## mufn_conditionhops_mean:adj_trial_id:nregions12         -0.01     0.58 1.00
## mufn_conditionraw_data:adj_trial_id:nregions12          -0.25     0.33 1.00
## mufp_conditiondotplot                                   -0.20     0.39 1.00
## mufp_conditionhalfeye                                   -0.45     0.16 1.00
## mufp_conditionhops_bootstrap                            -0.17     0.44 1.00
## mufp_conditionhops_mean                                 -0.54     0.05 1.00
## mufp_conditionraw_data                                  -0.26     0.34 1.00
## mufp_adj_trial_id                                       -0.47    -0.10 1.00
## mufp_nregions12                                         -0.63    -0.28 1.00
## mufp_conditiondotplot:adj_trial_id                      -0.10     0.42 1.00
## mufp_conditionhalfeye:adj_trial_id                      -0.37     0.16 1.00
## mufp_conditionhops_bootstrap:adj_trial_id               -0.37     0.19 1.00
## mufp_conditionhops_mean:adj_trial_id                    -0.42     0.15 1.00
## mufp_conditionraw_data:adj_trial_id                     -0.44     0.11 1.00
## mufp_conditiondotplot:nregions12                        -0.17     0.31 1.00
## mufp_conditionhalfeye:nregions12                        -0.26     0.23 1.00
## mufp_conditionhops_bootstrap:nregions12                 -0.16     0.34 1.00
## mufp_conditionhops_mean:nregions12                      -0.27     0.23 1.00
## mufp_conditionraw_data:nregions12                       -0.08     0.40 1.00
## mufp_adj_trial_id:nregions12                            -0.31     0.29 1.00
## mufp_conditiondotplot:adj_trial_id:nregions12           -0.68     0.22 1.00
## mufp_conditionhalfeye:adj_trial_id:nregions12           -0.37     0.53 1.00
## mufp_conditionhops_bootstrap:adj_trial_id:nregions12    -0.43     0.51 1.00
## mufp_conditionhops_mean:adj_trial_id:nregions12         -0.50     0.43 1.00
## mufp_conditionraw_data:adj_trial_id:nregions12          -0.32     0.59 1.00
##                                                      Bulk_ESS Tail_ESS
## mutn_Intercept                                           2246     3336
## mufn_Intercept                                           1712     2730
## mufp_Intercept                                           3406     4156
## mutn_conditiondotplot                                    2259     3635
## mutn_conditionhalfeye                                    2282     3469
## mutn_conditionhops_bootstrap                             2407     3246
## mutn_conditionhops_mean                                  2276     3232
## mutn_conditionraw_data                                   2519     3258
## mutn_adj_trial_id                                        3857     4485
## mutn_nregions12                                          4418     4394
## mutn_conditiondotplot:adj_trial_id                       4348     4897
## mutn_conditionhalfeye:adj_trial_id                       4226     4381
## mutn_conditionhops_bootstrap:adj_trial_id                4334     4376
## mutn_conditionhops_mean:adj_trial_id                     4192     4573
## mutn_conditionraw_data:adj_trial_id                      4244     4788
## mutn_conditiondotplot:nregions12                         4500     4465
## mutn_conditionhalfeye:nregions12                         4586     4781
## mutn_conditionhops_bootstrap:nregions12                  4400     4738
## mutn_conditionhops_mean:nregions12                       4441     4735
## mutn_conditionraw_data:nregions12                        4745     4445
## mutn_adj_trial_id:nregions12                             3750     4353
## mutn_conditiondotplot:adj_trial_id:nregions12            4086     4578
## mutn_conditionhalfeye:adj_trial_id:nregions12            3796     4728
## mutn_conditionhops_bootstrap:adj_trial_id:nregions12     4208     4737
## mutn_conditionhops_mean:adj_trial_id:nregions12          4287     4764
## mutn_conditionraw_data:adj_trial_id:nregions12           3981     4793
## mufn_conditiondotplot                                    1891     3126
## mufn_conditionhalfeye                                    2035     3402
## mufn_conditionhops_bootstrap                             1882     3449
## mufn_conditionhops_mean                                  1676     2988
## mufn_conditionraw_data                                   1797     3103
## mufn_adj_trial_id                                        3839     4733
## mufn_nregions12                                          4164     3854
## mufn_conditiondotplot:adj_trial_id                       4385     4912
## mufn_conditionhalfeye:adj_trial_id                       4035     4755
## mufn_conditionhops_bootstrap:adj_trial_id                4315     4507
## mufn_conditionhops_mean:adj_trial_id                     4538     4912
## mufn_conditionraw_data:adj_trial_id                      4065     4793
## mufn_conditiondotplot:nregions12                         4238     4487
## mufn_conditionhalfeye:nregions12                         4569     4951
## mufn_conditionhops_bootstrap:nregions12                  4196     4604
## mufn_conditionhops_mean:nregions12                       4175     4654
## mufn_conditionraw_data:nregions12                        4340     4890
## mufn_adj_trial_id:nregions12                             4099     4635
## mufn_conditiondotplot:adj_trial_id:nregions12            4338     4603
## mufn_conditionhalfeye:adj_trial_id:nregions12            4278     4823
## mufn_conditionhops_bootstrap:adj_trial_id:nregions12     4302     4747
## mufn_conditionhops_mean:adj_trial_id:nregions12          4438     4680
## mufn_conditionraw_data:adj_trial_id:nregions12           4150     4525
## mufp_conditiondotplot                                    3115     3809
## mufp_conditionhalfeye                                    3554     4413
## mufp_conditionhops_bootstrap                             3857     4489
## mufp_conditionhops_mean                                  3440     4118
## mufp_conditionraw_data                                   3344     4038
## mufp_adj_trial_id                                        3929     4739
## mufp_nregions12                                          3923     4503
## mufp_conditiondotplot:adj_trial_id                       4418     4735
## mufp_conditionhalfeye:adj_trial_id                       4421     4828
## mufp_conditionhops_bootstrap:adj_trial_id                4554     4714
## mufp_conditionhops_mean:adj_trial_id                     4543     4879
## mufp_conditionraw_data:adj_trial_id                      4193     4581
## mufp_conditiondotplot:nregions12                         4245     4664
## mufp_conditionhalfeye:nregions12                         4033     4515
## mufp_conditionhops_bootstrap:nregions12                  4353     4552
## mufp_conditionhops_mean:nregions12                       4420     4549
## mufp_conditionraw_data:nregions12                        4489     4733
## mufp_adj_trial_id:nregions12                             4587     4805
## mufp_conditiondotplot:adj_trial_id:nregions12            4864     4734
## mufp_conditionhalfeye:adj_trial_id:nregions12            4622     4946
## mufp_conditionhops_bootstrap:adj_trial_id:nregions12     4750     5080
## mufp_conditionhops_mean:adj_trial_id:nregions12          5050     4909
## mufp_conditionraw_data:adj_trial_id:nregions12           4766     4866
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Model diagnostics

Before we show the results from the model, we first run some posterior predictive checks to make sure that the model is able to recover the actual data

As the posterior predictive checks from the model, for all our primary population-level parameters, look good, we will examine the results closely. Before we try to visualise the model predictions, we need to extract posterior samples from the fitted model object:

RQ 1: do users implicitly perform some form of multiple comparisons correction?

If users are not performing any form of multiple comparisons correction, then intuitively, on average, a participant in our study will have more number of False positives when presented with 12 graphs as opposed to 8 graphs (the two within person conditions). More directly, we can compare the False Discovery rate when \(nregions = 8\) vs when \(nregions = 12\). If the FDR is constant or less for \(nregions = 12\) compared to \(nregions = 8\), then it implies that participants are performing some form of multiple comparisons correction.

First, we need to calculate the False Positive rate without any multiple comparisons correction:

FDR’s for combinations of method (uncorrected and B-H) and number of regions/hypotheses (8 or 12)

## # A tibble: 4 x 3
##   ntrials method         FDR
##     <int> <chr>        <dbl>
## 1       8 bh          0.0353
## 2       8 uncorrected 0.0934
## 3      12 bh          0.0164
## 4      12 uncorrected 0.0321

Compute average marginal effects (AMEs) for nregions and condition from posterior retrodictive, and compare against simulated FDR’s. Colors from https://coolors.co/393d3f-c98ca7-e76d83-f5b700-a5c4d4

Colors: https://coolors.co/27187e-758bfd-aeb8fe-f5b700-f1f2f6

Individual plot for FP

In the figure below, we see that, on average, the FDR decreases for participants when presented with more graphs. This suggests that they are likely performing some form of implicit multiple comparisons correction.

## # A tibble: 2 x 7
##   ntrials fp_rate .lower .upper .width .point .interval
##     <int>   <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
## 1       8   0.145 0.134   0.157   0.95 median qi       
## 2      12   0.103 0.0935  0.114   0.95 median qi

We see that the decrease in the False Discovery Rate is, on average, a magnitude of 4 percentage points, with a 95% credible interval of [0.033, 0.051]. This implies there is almost a 30% reduction in the FDR.

Next, since we have different visualization conditions, we inspect if this difference is persistent across all the conditions. From the following figure below, we see that the FDR decreases for participants consistently across all the uncertainty representations, suggesting that it is likely not an artifact of certain forms of visual representations. The magnitude of this decrease appears to be consistent across the different uncertainty representations as well.

Thus, our results suggest that users in our experimental set up implicitly perform some form of multiple comparisons correction. Because our experimental design incentivised participants against making False Discoveries proportionate to performing a NHST at a 95% confidence intervals, we cannot tell if participants would always behave this way. We believe that in the absence of incentives, participants may not control for False Positives in a similar manner, as suggested by the results of the study by Zgraggen et al.

RQ 2: do users implicitly perform some form of multiple comparisons correction?

To answer this question, we first look at the FDR across the different uncertainty representations, marginalised over the number of regions shown to participants. This gives us the aggregate effect over the two within-subjects conditions in the study.

From the figure above, we can see that, on average, using uncertainty representatoions such as Hypothetical Outcome Plot of the mean difference and Probability Density Function of the difference reliably decreases the FDR, with an observed decrease of 0.04 and 0.03 percentage points respectively (95% CI: [-0.072, 0.005] and [0.065, 0.02] respectively). Some other commonly used uncertainty representations, such as Confidence Intervals appear to have a small, but unreliable effect towards decreasing (~ 1.5 percentage points, 95% CI: [-0.05, 0.02]) FDR. On the other hand, other certain other uncertainty representations such as dotplots of the mean difference and Hypothetical outcome plot of bootstrapped data samples` appear to have no improvement or even slightly worsen the FDR. (the exact estimates are present in the table below)

## # A tibble: 5 x 7
##   condition                  fp_rate  .lower   .upper .width .point .interval
##   <chr>                        <dbl>   <dbl>    <dbl>  <dbl> <chr>  <chr>    
## 1 ci - raw_data             -0.0150  -0.0506  0.0176    0.95 median qi       
## 2 dotplot - raw_data        -0.00300 -0.0404  0.0328    0.95 median qi       
## 3 halfeye - raw_data        -0.0311  -0.0657  0.00208   0.95 median qi       
## 4 hops_bootstrap - raw_data  0.00770 -0.0307  0.0475    0.95 median qi       
## 5 hops_mean - raw_data      -0.0382  -0.0719 -0.00541   0.95 median qi

Are these differences consistent across the number of regions shown?

Based on the plots below, we find that the differences are fairly consistent across the within-subjects manipulation.

Calculation of marginalised estimates

Learning effects?

Before we compare the composite scores, we look at the potential learning effects in our primary research questions. In repeated measures experimental designs such as this, where we also provide participants feedback (in the first 5 trials in each block), we might expect to see some learning effect (or at least variation in the responses over the course of the trials). In the figure below, we plot the change in the probability of TP/TN/FP/FN in each condition.

From the figure below, we see that the our participants still perform some form of implicit multiple comparisons correction, and this persists even after accounting for the potential effect of learning.

Next we test if there were any effects of learning on the differences between the uncertainty representations. In other words, does the effect of learning dominate over the effect of the uncertainty representation?

From the figure below, we can see that the effect persists for the unertainty representations which reduces FDR (probability density plot and HOPs of the mean difference), although the magnitude of the mean effect is smaller by around 1 percentage point. This indicates that certain forms of uncertainty representations are reliably better at reducing the FDR.

Exploratory analysis

We first take a look at the probability of TP/TN/FP/FN in each uncertainty representation condition, marginalised over \(nregions\). Interestingly, we find that there isn’t a lot of difference in the probability of an average user making a FP on an average trial, including some of the uncertainty representations with better FDR; except for the dotplot condition, all the other uncertainty representations appear comparable to the baseline (raw data) in terms of probability of False Positives in an average trial. The improvement in FDR usually arises from the analysts being able to correctly identify True Positives more accurately where we see large differences.The probability of False Negative also varies substantially across the different conditions, with HOPS bootstrap performing worse and all the other uncertainty representations performing better than the baseline condition (with dotplot performing the best).There is also variation in the probability of making a False Negative across the conditions, but little or no difference between the probability of making a True Negative.

Because it is difficult to compare the rates of TP / TN / FP / FN across conditions, we will use metrics developed in ML such as F-scores and Matthews Correlation Coefficient to obtain a composite score. Another way of comparing the different conditions would be to use the payout which serves as the incentive for the different participants.

Comparison of the different using composite metrics (F-scores, MCC and Payout)

F-scores are a common metric used in ML to get a composite score for the performance of an algorithm and takes into account the number of True Positives, False Positives and False Negatives. It is given by \(\text{F-score} = \frac{2precision}{precision + recall} = \frac{2TP}{2TP + FN + FP}\). We can use this to compare performance across the different uncertainty representation conditions.

In this analysis, we would compare the F-scores of an average user to those of the BH procedure (which we consider optimal for this task).

First we compare the difference in F-scores when we manipulate \(nregions\) i.e. the number of graphs presented to participants (8 or 12). We see that F-scores actually decrease when participants were presented with more graphs (which might be because the decrease in FDR might also entail a decrease in the number of True Positives).

Estimate of f-scores for each region \(\in\) {8, 12}.

## # A tibble: 2 x 7
##   ntrials fscore .lower .upper .width .point .interval
##     <int>  <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
## 1       8  0.656  0.637  0.675   0.95 median qi       
## 2      12  0.637  0.616  0.659   0.95 median qi

Visualisation of the probability distribution of f-scores for each region \(\in\) {8, 12}.

Next, comparing the difference in F-scores between different uncertainty representations, we see that all the uncertainty representations, except HOPs of bootstrapped data samples reliably increase F-scores. In other words, the accuracy increases when these uncertainty representations are used. Interestingly, the dotplot results in the highest accuracy (by almost 15 percentage points, 95% CI: [0.08, 0.2]) compared to the Baseline. Other uncertainty representations such as probability density plots, HOPS and 95% confidence intervals of the mean difference also improves the F-scores.

The following table summarise the difference in F-scores of the different conditions compared to the baseline.

## # A tibble: 5 x 7
##   condition                  fscore   .lower  .upper .width .point .interval
##   <chr>                       <dbl>    <dbl>   <dbl>  <dbl> <chr>  <chr>    
## 1 ci - raw_data              0.0704  0.00753  0.134    0.95 median qi       
## 2 dotplot - raw_data         0.145   0.0826   0.207    0.95 median qi       
## 3 halfeye - raw_data         0.108   0.0410   0.174    0.95 median qi       
## 4 hops_bootstrap - raw_data -0.115  -0.194   -0.0403   0.95 median qi       
## 5 hops_mean - raw_data       0.0753  0.00431  0.145    0.95 median qi

Matthews Correlation Coefficient: One common drawback of the F-score is that it does not take into account False Negatives. The Matthew’s Correlation Coefficient is a proposed measure to address this limitation, and is calculated as \(MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}\).

In our study, we incentivise participants using a payout scheme. It might be the case that participants are optimising for the incentives provided. Hence we compare the average payout across the different conditions. Based on this measure, we see that participants in the probability density plots and HOPS of the mean difference, on average, have higher payouts.

## quartz_off_screen 
##                 2
## # A tibble: 36 x 4
## # Groups:   ntrials, condition [12]
##    ntrials condition greater_than      p
##      <int> <fct>     <chr>         <dbl>
##  1       8 raw_data  bh           0     
##  2       8 raw_data  uncorrected  0.255 
##  3       8 raw_data  zero         0     
##  4       8 ci        bh           0     
##  5       8 ci        uncorrected  0.282 
##  6       8 ci        zero         0     
##  7       8 dotplot   bh           0     
##  8       8 dotplot   uncorrected  0.0342
##  9       8 dotplot   zero         0     
## 10       8 halfeye   bh           0     
## # … with 26 more rows
## # A tibble: 2 x 3
##   method      ntrials payout
##   <fct>         <int>  <dbl>
## 1 uncorrected       8  -53.7
## 2 uncorrected      12   50.3

Power analysis